feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84
feat: add bmad-eval-runner skill with isolation, dependency staging, and full docs#84
Conversation
New skill for running a target skill's evals in a clean, isolated environment.
Supports both artifact evals (evals.json with expectations) and trigger evals
(triggers.json with should_trigger). Adapted from Anthropic's skill-creator
eval pipeline (run_eval.py, grader.md, generate_review.py).
Isolation strategy:
- Docker preferred: each eval runs in a fresh bmad-eval-runner:latest
container with HOME pointed at an empty in-container dir, no host
CLAUDE.md or auto-memory bleed-through. Image built on first run.
- Local fallback: ~/bmad-evals/<run-id>/<eval-id>/ with HOME override
to a clean .home/ directory. Best-effort; user is told.
Artifacts (transcript, files Claude wrote, metrics, grading) are retained
permanently per run so users can review what happened, not just whether
it passed.
Layout:
SKILL.md outcome-driven entry
references/isolation.md Docker + local strategies
references/eval-formats.md evals.json + triggers.json schemas
scripts/run_evals.py artifact runner
scripts/run_triggers.py trigger runner (adapted from Anthropic)
scripts/docker_setup.py Docker detection + image build
scripts/generate_report.py aggregate HTML report
scripts/utils.py shared helpers
agents/grader.md judge subagent
assets/Dockerfile clean Claude Code image
Three fixes from running the runner end-to-end against bmad-product-brief: 1. Stage Claude Code OAuth credentials into each isolated workspace. Both isolation modes override HOME, so the subprocess can't read the host's ~/.claude/ and the macOS Keychain ACL prevents it from reading the credential directly. The parent process (which owns the ACL) now reads "Claude Code-credentials" via `security find-generic-password` once at import, then writes it as .credentials.json into each workspace's .claude/ before launching claude -p. ANTHROPIC_API_KEY passthrough still works as a fallback for non-macOS hosts. 2. Trigger detection: place the synthetic skill at .claude/skills/<name>/ SKILL.md instead of .claude/commands/<name>.md. Slash commands do not surface as Skill tool calls, which is why the previous implementation (matching Anthropic's reference run_eval.py) reported 0% trigger rates for every should-trigger query. Real skills under .claude/skills/ do fire the Skill tool, letting the existing detector observe genuine trigger events. 3. Docker credential mount: write to a dedicated <eval-dir>/creds/ directory so the container mount holds exactly one file at the expected path (`/creds/.credentials.json`). Mounting eval-dir directly would expose all run output and required the container to know an undocumented dot-prefix filename. isolation.md and SKILL.md updated to document the auth flow, the local-mode trigger leak (host's installed skills can bleed in via cwd discovery despite HOME override — prefer Docker for triggers), and why real-skill placement is correct vs. slash-command placement. Multi-turn workflow handling for non-headless skills is still TODO.
- Setup overlay system: rsync evals/setup/ (base) and evals/<id>/setup/ (per-eval) onto each workspace before skill staging, enabling dependency skills and _bmad/ config to be available inside the sandbox - Add parse_skill_dependencies, discover_setup_dirs, apply_setup_overlay to utils.py; wire through run_evals.py for both local and Docker modes - Fix 0% trigger rate: add --dangerously-skip-permissions to all claude -p invocations in run_triggers.py (without it Skill tool cannot read SKILL.md) - Upgrade grader.md with richer transcript parsing guidance (tool-call patterns, phase ordering, read-only enforcement, JSON block extraction) - Expand eval-formats.md reference with setup overlay and dependency docs - Default workers bumped to 8 - Add pty_runner.py (experimental; not wired into main flow)
…ad-eval-runner - explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading - explanation/why-bmad-eval-runner.md: isolation, dependency staging, real triggers, permanent artifacts - how-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety notes - how-to/run-evals-against-a-skill.md: 5-step run flow with brief eval suite as worked example - reference/eval-format.md: complete schema for evals.json + triggers.json (fixtures, setup overlays, per-eval timeout) - _diagrams/eval-test-types.excalidraw: source diagram with Playwright renderer (render.mjs + render.html) - public/img/eval-test-types.png: rendered architecture diagram embedded in what-are-evals.md - update explanation/index.md and reference/index.md sidebars
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (22)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🤖 Augment PR SummarySummary: Adds the Changes:
Technical Notes: Workspaces are staged via project rsync + setup overlays + fixtures; runs capture stream-JSON transcripts and support per-eval timeout overrides. 🤖 Was this summary useful? React with 👍 or 👎 |
|
|
||
| const sceneJson = JSON.parse(readFileSync(inPath, "utf-8")); | ||
|
|
||
| const htmlPath = resolve(fileURLToPath(import.meta.url), "..", "excalidraw_render.html"); |
There was a problem hiding this comment.
| workspace_snapshot_before = snapshot_files(workspace_project) | ||
|
|
||
| home_dir = workspace_root / ".home" | ||
| stage_credentials(home_dir / ".claude", _KEYCHAIN_CREDS) |
There was a problem hiding this comment.
Staging the macOS Keychain OAuth JSON into the per-eval run directory (workspace/.home/.claude/.credentials.json and eval_dir/creds/.credentials.json) appears to persist credentials in the “artifacts are forever” run folder, which is a significant secret-leak risk if runs are backed up or shared.
Severity: high
Other Locations
skills/bmad-eval-runner/scripts/run_evals.py:232skills/bmad-eval-runner/scripts/run_triggers.py:148skills/bmad-eval-runner/scripts/run_triggers.py:225
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| if e.stderr: | ||
| stderr_tail += "\n" + e.stderr.decode("utf-8", errors="replace")[-2000:] | ||
|
|
||
| new_files = diff_workspace(workspace_project, workspace_snapshot_before) |
There was a problem hiding this comment.
Local artifact capture only includes newly-created paths (after - before), so edits to existing files (e.g., Update/Validate flows) won’t be reflected in artifacts/ for grading; in Docker mode the container script rsyncs the entire workspace (including the whole project), which can massively bloat runs and dilute what the skill actually produced.
Severity: medium
Other Locations
skills/bmad-eval-runner/scripts/run_evals.py:259
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| for src in setup_dirs: | ||
| if not src.is_dir(): | ||
| continue | ||
| subprocess.run( |
There was a problem hiding this comment.
apply_setup_overlay() shells out to rsync unconditionally and ignores failures (check=False), so on hosts without rsync (or if rsync errors) overlays can silently not apply and dependency staging can fail in hard-to-debug ways.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| pending_tool = name | ||
| accumulated_json = "" | ||
| else: | ||
| return False, "" |
There was a problem hiding this comment.
parse_stream_for_trigger() returns False immediately when it sees a tool_use that isn’t Skill/Read (and also returns False after the first assistant event lacking the tool), which can create false negatives if the synthetic skill fires later in the stream.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
Summary
Adds the
bmad-eval-runnerskill plus complete documentation. The runner evaluates a skill's behavior in an isolated workspace (Docker preferred, local fallback) and grades the result against eval-author expectations.Skill (5 commits)
bmad-eval-runnerwithclaude -pbased execution, isolation strategies, and discovery.claude/skills/<unique>/SKILL.mdso the Skill tool can actually fireevals/setup/) and per-eval (evals/<id>/setup/) directories rsynced into the workspace before the skill stages, enabling dependency skills to be available inside the sandbox--dangerously-skip-permissionsadded toclaude -pinvocations so the Skill tool can read SKILL.md (fixes 0% trigger rate)evals.jsonentries can set"timeout": Nto override the runner's defaultDocs
explanation/what-are-evals.md: artifact vs trigger evals; output vs transcript grading; best practices; worked example pointing atbmad-product-briefexplanation/why-bmad-eval-runner.md: isolation, dependency staging, trigger detection, permanent artifactshow-to/install-docker-for-evals.md: Docker Desktop setup with credential-safety noteshow-to/run-evals-against-a-skill.md: 5-step run flow with worked examplereference/eval-format.md: complete schema (fixtures, setup overlays, per-eval timeout)_diagrams/eval-test-types.excalidraw+ Playwright renderer (render.mjs+render.html)public/img/eval-test-types.png: rendered architecture diagramTest Plan
bmad-product-brief(17 artifact evals, all 17 ran; 16 passed, 1 timeout that was traced to a too-tight per-eval timeout, fixed by the new override field)bmad-distillator, editorial review skills) available inside the sandboxdocs/_diagrams/render.mjs)